mpg dataggplot2 dataset, you need to first load tidyverse or ggplot2
Fuel economy data from 1999 and 2008 for 38 popular models of car
library(tidyverse)
?mpg
head(mpg,3)
## # A tibble: 3 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.80 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.80 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2.00 2008 4 manu… f 20 31 p comp…
mpg$class[1:20]
## [1] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
## [8] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
## [15] "compact" "midsize" "midsize" "midsize" "suv" "suv"
ggplot: Barplotsgeom_bar is designed to make it easy to create bar charts that show counts
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()
Bar charts are automatically stacked when multiple bars are “defined”" at the same location.
description of drv: f = front-wheel drive, r = rear wheel drive, 4 = 4wd
g + geom_bar(aes(fill = drv))
To have side by side plots for the groups you would like to compare, you need to use the position argument within the geom_ functions. You can either simply assign the name of positioning as a string or assign a position_ function for more details.
g + geom_bar(aes(fill = drv), position = "dodge")
g + geom_bar(aes(fill = drv),position = position_dodge(width = 0.4))
Unfortunately ggplot2 will not calculate means of a variable for you.
If you would like to plot means or any other values other than the count of categories you need to provide them to ggplot() in the following format:
Assuming that outcome is the mean outcome of treatments (trt) a, b and c here.
df <- data.frame(trt = c("a", "b", "c"), outcome = c(2.3, 1.9, 3.2))
df
## trt outcome
## 1 a 2.3
## 2 b 1.9
## 3 c 3.2
Then use geom_col()
ggplot(df, aes(trt, outcome)) +
geom_col()
You can also change colors easily
ggplot(df, aes(trt, outcome)) +
geom_col(fill="steelblue")
ggplot(df, aes(trt, outcome)) +
geom_col(aes(fill=trt))
geom_bar() with continuous dataYou can also use geom_bar() with continuous data, in which case it will show counts at unique locations.
ggplot(iris, aes(Sepal.Length)) + geom_bar()
ggplot: HistogramsHistograms are basically barplots with “bins”.
To construct a histogram,
- “bin” the range of values. What is meant by “bining” is, to divide the entire range of values into a series of intervals.
- Then count how many values fall into each interval.
The bins are usually specified as consecutive, non-overlapping intervals of a variable.
ggplot(iris, aes(Sepal.Length)) + geom_histogram(binwidth = 0.2)
ggplot(iris, aes(Sepal.Length, fill=Species)) +
geom_histogram(binwidth = 0.5)
ggplot(iris, aes(Sepal.Length, fill=Species)) +
geom_histogram(binwidth = 0.5, alpha=0.4)
ggplot(iris, aes(Sepal.Length, fill=Species)) +
geom_histogram(binwidth = 0.5, position = "identity", alpha=0.4)
ggplot: BoxplotsA box plot is a easy way to get some overall idea about the distribution of your variable. Let’s have a quick review on how to read a box plot:
Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population
ToothGrowth datasethead(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
?ToothGrowth
Here’s how the the distribution of tooth length for different regimens look like.
gb <- ToothGrowth %>%
unite(regimen, -len, sep= "_") %>%
ggplot(data=., aes(regimen, len)) +
geom_boxplot(aes(color=regimen))
Are you confused with this code?
We could get the exact same result with the following:
ToothGrowth$regimen <- paste(ToothGrowth$supp, ToothGrowth$dose, sep="_")
gb <- ggplot(data=ToothGrowth, aes(regimen, len)) +
geom_boxplot(aes(color=regimen))
gb
So you see, these dplyr functions just make life a little easier, and potentially your code looks cleaner and easier to read. We’ll talk about that later.
geom_dotplotWhat’s a dotplot?
In a dot plot, the width of a dot corresponds to the bin width (or maximum width, depending on the binning algorithm), and dots are stacked, with each dot representing one observation.
Example:
set.seed(2211) # get the same random variables from the normal distribution
df <- data.frame(x= paste0("time",sort(rep(1:4, 25))),
y= c(rnorm(25, 0, 2),
rnorm(25, 1, 2),
rnorm(25, 2, 2),
rnorm(25, 3, 2)))
head(df)
## x y
## 1 time1 0.4339022
## 2 time1 -1.7532299
## 3 time1 3.1492460
## 4 time1 0.5317880
## 5 time1 2.2812159
## 6 time1 0.6126187
# look at distribution of y
ggplot(data=df, aes(y)) + geom_histogram(binwidth = 1)
# look at distribution of x
ggplot(data=df, aes(x)) + geom_bar()
# now lets look with dotplot
ggplot(data=df, aes(x, y)) +
geom_dotplot(binaxis="y",
binwidth = 0.3,
stackdir = "center")
Change the binwidth to 0.1, 0.5, 1 one by one. Then change the stackdir to “up” (default), “down”, “center”, “centerwhole”.
What’s happening?
geom_dotplotgb + geom_dotplot(binwidth = 0.5,
aes(fill = regimen),
alpha = 0.3,
binaxis = "y",
stackdir = "center")
ggplot2: Plotting means and error barsThe error bars can define multiple types of variation:
Standard deviation (sd) shows how much each data point deviate from the mean “on average”. This is useful if you would like to have a general idea about how the data is distributed.
Standard error of the mean (sem) gives a value explaining how much your estimated mean is affected by the sampling variability. It can be difficult to interpret.
95% Confidence interval (ci95) the interval which you are 95% certain that the true mean falls into. This is probably the best type of variation if we are really interested in estimating the mean.
As mentioned earlier, ggplot2 requires calculated summary statistics to be given to the ggplot() in a data frame form. This is why I keep mentioning dplyr package. It provides nice tools to manipulate data and get some summary statistics out of it.
So, now I will to compute the mean, sd, sem and CI95. To compute sem and CI95 I need to write my own functions…
Now I would like to write my own custom functions to compute some of the statistics.
For that I will use the function() function. For example, I would like to define a function called add5mult3mean that
- Takes a numeric vector
- Adds 5 to all of its elements
- Multiplies with 3
- And finally gives the mean of this modified vector
add5mult3mean <- function(x) mean((x+5)*3)
myVec <- 1:10
add5mult3mean(myVec)
## [1] 31.5
# or
mean((myVec+5)*3)
## [1] 31.5
You can name the argument whatever you want as long as it’s consistent throughout your function.
So the following would again give me the same:
add5mult3mean <- function(a_numeric_vector) mean((a_numeric_vector+5)*3)
add5mult3mean(myVec)
## [1] 31.5
You can add as many arguments as you want and use objects in your environment.
irisChopper <- function(breaks){
chopped.iris <- data.frame(cat.Sep.Len = cut(iris$Sepal.Length, breaks = breaks),
cat.Sep.Wid = cut(iris$Sepal.Width, breaks = breaks),
cat.Pet.Len = cut(iris$Petal.Length, breaks = breaks),
cat.Pet.Wid = cut(iris$Petal.Width, breaks = breaks),
iris$Species)
head(chopped.iris)
}
irisChopper(10)
## cat.Sep.Len cat.Sep.Wid cat.Pet.Len cat.Pet.Wid iris.Species
## 1 (5.02,5.38] (3.44,3.68] (0.994,1.59] (0.0976,0.34] setosa
## 2 (4.66,5.02] (2.96,3.2] (0.994,1.59] (0.0976,0.34] setosa
## 3 (4.66,5.02] (2.96,3.2] (0.994,1.59] (0.0976,0.34] setosa
## 4 (4.3,4.66] (2.96,3.2] (0.994,1.59] (0.0976,0.34] setosa
## 5 (4.66,5.02] (3.44,3.68] (0.994,1.59] (0.0976,0.34] setosa
## 6 (5.38,5.74] (3.68,3.92] (1.59,2.18] (0.34,0.58] setosa
irisChopper(0:10)
## cat.Sep.Len cat.Sep.Wid cat.Pet.Len cat.Pet.Wid iris.Species
## 1 (5,6] (3,4] (1,2] (0,1] setosa
## 2 (4,5] (2,3] (1,2] (0,1] setosa
## 3 (4,5] (3,4] (1,2] (0,1] setosa
## 4 (4,5] (3,4] (1,2] (0,1] setosa
## 5 (4,5] (3,4] (1,2] (0,1] setosa
## 6 (5,6] (3,4] (1,2] (0,1] setosa
Now lets get back to our functions to calculate some statistics. Function sem() computes the standard error of the mean for a numeric vector. Similarly, ci95 function computes the mean confidence interval of the numeric vector given.
# here I am defining a function
# that would compute the standard error of the mean
sem <- function(x) sd(x)/sqrt(length(x))
# here I am defining a function
# that would compute the CI95.
# am computing this with t-distribution
# since I don't know the population standard deviation and
# my sample size for each treatment group is less than 30 --- a general rule
# qt() function is the t-distribution quantile function.
# It basically give the t-statistic value for the quantile given
ci95 <- function(x) qt(0.975,df=length(x)-1)*sd(x)/sqrt(length(x))
If you run these commands, you’ll see that you have defined functions in your environment.
Now let’s apply these to our data using dplyr functions.
ToothGrowthtg.stats <- ToothGrowth %>%
group_by(supp, dose) %>% # group by
summarise(.,
mean.len = mean(len), # mean
sd.len = sd(len), # sd
se.len = sem(len), # sem
ci95.len = ci95(len), # ci95
n = length(len)) %>% # count
right_join(ToothGrowth)
Here I am using the pipe %>% and functions group_by(), summarise(), right_join() from the dplyr package. Let’s not get into details now, but very briefly this is what I am doing:
ToothGrowth %>% I am piping the ToothGrowth data in the function that comes up next. This way I can keep piping things to other functions.
x %>% f(y) is the same as f(x, y)y %>% f(x, ., z) is the same as f(x, y, z )group_by(supp, dose): Grouping it based on its supp and dose columns.
summarize(...): Computing the statistics for every len observation in the respective supp and dose group. Statistics I compute are:
-sd
-sem
-ci95
length (this isn’t really a statistic is just the count of the observation in each group)
right_join(ToothGrowth)I am appending the summary statistics to the original data
Now this is how tg.stats look
tg.stats
## # A tibble: 60 x 8
## # Groups: supp [?]
## supp dose mean.len sd.len se.len ci95.len n len
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
## 1 VC 0.500 7.98 2.75 0.869 1.96 10 4.20
## 2 VC 0.500 7.98 2.75 0.869 1.96 10 11.5
## 3 VC 0.500 7.98 2.75 0.869 1.96 10 7.30
## 4 VC 0.500 7.98 2.75 0.869 1.96 10 5.80
## 5 VC 0.500 7.98 2.75 0.869 1.96 10 6.40
## 6 VC 0.500 7.98 2.75 0.869 1.96 10 10.0
## 7 VC 0.500 7.98 2.75 0.869 1.96 10 11.2
## 8 VC 0.500 7.98 2.75 0.869 1.96 10 11.2
## 9 VC 0.500 7.98 2.75 0.869 1.96 10 5.20
## 10 VC 0.500 7.98 2.75 0.869 1.96 10 7.00
## # ... with 50 more rows
Finally we can start plotting! Let’s start with line graphs.
Step by step:
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18)
Oh you’re wondering about the shapes? We’ll come to that…
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1)
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
geom_line()
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
geom_line() +
geom_point(aes(x=dose, y=len, fill=supp), alpha=0.4, size=1.5)
Now that we have more or less the plot we want, let’s customize it further.
ggplot2Modify axis, legend, and plot labels see Reference ggplot2: Scale
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
geom_line() +
geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R")
Define the limits of axes
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
geom_line() +
geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R") +
# Change x and y axes limits
xlim(0, 3) +
ylim(0, 60)
We saw in the previous session how to change legend parameters with guides() function. To change the legend title we can use the same function. See (Reference ggplot2 Guides)[http://ggplot2.tidyverse.org/reference/index.html#section-guides-axes-and-legends]
ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
geom_line() +
geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R") +
# Change the legend
guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))
I want to customize the jitter of the points. To adjust positions look at position adjustment in ggplot2 Reference
pd <- position_dodge(0.1)
# my plot is getting very long, I'll assign it
tgp <- ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) +
geom_point(aes(x=dose, y= mean.len), size=4, shape=18, position=pd) +
geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1, position=pd) +
geom_line(position=pd) +
geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5, position=pd) +
# Here I add the titles I want
labs(x="Supplement Dose",
y="Tooth Length",
title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
caption = "Based on the data from R") +
# Change the legend
guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))
tgp
Theme allows to make cosmetic changes
tgp + theme_bw()
tgp + theme_minimal()